import pandas as pd
pd.options.mode.chained_assignment = None
import plotly.express as px
import numpy as np
import wbdata
import json
import requests
import plotly.io as pio
pio.renderers.default='notebook'
I will be using the Aquastat databases which is provided by the Food and Agriculture Organization of the United Nations. The database provides information about water and land resources for several countries. These can be grouped by region or income levels.
I will be using the Aquastat databases which is provided by the Food and Agriculture Organization of the United Nations. The entire dataset takes approximatley 1 TB in disk space. As such, I used the interactive tool on the FAO website to download a subset of the overall repository. In particular, I selected population data; the prevelance and population of people undernourished, to be used as a proxy for poverty; and data about access to sage drinking water. The data has been restricted to the last 4 years.
I will begin by reading the data into pandas:
source = pd.read_csv("./aquastat1.csv",low_memory=False)
source.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2800 entries, 0 to 2799 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Area 2800 non-null object 1 Area Id 2800 non-null int64 2 Variable Name 2800 non-null object 3 Variable Id 2800 non-null int64 4 Year 2800 non-null int64 5 Value 2800 non-null float64 6 Symbol 2800 non-null object dtypes: float64(1), int64(3), object(3) memory usage: 153.2+ KB
source.head()
| Area | Area Id | Variable Name | Variable Id | Year | Value | Symbol | |
|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 4 | Total area of the country (excl. coastal water) | 4100 | 2018 | 65286.000000 | E |
| 1 | Afghanistan | 4 | Arable land area | 4101 | 2018 | 7794.000000 | X |
| 2 | Afghanistan | 4 | Permanent crops area | 4102 | 2018 | 216.000000 | X |
| 3 | Afghanistan | 4 | Cultivated area (arable land + permanent crops) | 4103 | 2018 | 8010.000000 | X |
| 4 | Afghanistan | 4 | % of total country area cultivated | 4470 | 2018 | 12.269093 | E |
I will also use data from the WorldBank through the Wbdata interface, to obtain information about each country, such as income level and region information. There is a large amount of data availble from this dataset using this interface.
countries = wbdata.get_country()
countries = pd.json_normalize(countries)
countries.head()
| id | iso2Code | name | capitalCity | longitude | latitude | region.id | region.iso2code | region.value | adminregion.id | adminregion.iso2code | adminregion.value | incomeLevel.id | incomeLevel.iso2code | incomeLevel.value | lendingType.id | lendingType.iso2code | lendingType.value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ABW | AW | Aruba | Oranjestad | -70.0167 | 12.5167 | LCN | ZJ | Latin America & Caribbean | HIC | XD | High income | LNX | XX | Not classified | |||
| 1 | AFE | ZH | Africa Eastern and Southern | NA | NA | Aggregates | NA | NA | Aggregates | Aggregates | ||||||||
| 2 | AFG | AF | Afghanistan | Kabul | 69.1761 | 34.5228 | SAS | 8S | South Asia | SAS | 8S | South Asia | LIC | XM | Low income | IDX | XI | IDA |
| 3 | AFR | A9 | Africa | NA | NA | Aggregates | NA | NA | Aggregates | Aggregates | ||||||||
| 4 | AFW | ZI | Africa Western and Central | NA | NA | Aggregates | NA | NA | Aggregates | Aggregates |
I will now show the data from FAO for Andorra as an example.
source[source["Area"]=='Andorra']
| Area | Area Id | Variable Name | Variable Id | Year | Value | Symbol | |
|---|---|---|---|---|---|---|---|
| 45 | Andorra | 20 | Total area of the country (excl. coastal water) | 4100 | 2018 | 47.000000 | E |
| 46 | Andorra | 20 | Arable land area | 4101 | 2018 | 0.830000 | X |
| 47 | Andorra | 20 | Permanent crops area | 4102 | 2018 | 0.000000 | I |
| 48 | Andorra | 20 | Cultivated area (arable land + permanent crops) | 4103 | 2018 | 0.830000 | X |
| 49 | Andorra | 20 | % of total country area cultivated | 4470 | 2018 | 1.765957 | E |
| 50 | Andorra | 20 | Total population | 4104 | 2018 | 77.006000 | X |
| 51 | Andorra | 20 | Rural population | 4105 | 2018 | 9.187000 | X |
| 52 | Andorra | 20 | Urban population | 4106 | 2018 | 67.766000 | X |
| 53 | Andorra | 20 | Population density | 4107 | 2018 | 163.842553 | E |
| 54 | Andorra | 20 | Total population with access to safe drinking-... | 4114 | 2018 | 100.000000 | I |
| 55 | Andorra | 20 | Rural population with access to safe drinking-... | 4115 | 2018 | 100.000000 | I |
| 56 | Andorra | 20 | Urban population with access to safe drinking-... | 4116 | 2018 | 100.000000 | I |
I will now convert the data to wide format so that each row represents a country.
source_wide = source.groupby(["Area","Variable Name"])["Value"].mean().unstack()
source_wide.reset_index(inplace=True)
source_wide.head()
| Variable Name | Area | % of cultivated land irrigated [harvested crop] | % of total country area cultivated | Arable land area | Cultivated area (arable land + permanent crops) | Number of people undernourished (3-year average) | Permanent crops area | Population density | Prevalence of undernourishment (3-year average) | Rural population | Rural population with access to safe drinking-water (JMP) | Total area of the country (excl. coastal water) | Total population | Total population with access to safe drinking-water (JMP) | Urban population | Urban population with access to safe drinking-water (JMP) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 27.166042 | 12.269093 | 7794.000 | 8010.000 | 8300.0 | 216.000 | 56.937048 | 23.0 | 27099.874 | 47.0 | 65286.0 | 37171.921 | 55.3 | 9273.302 | 78.2 |
| 1 | Albania | 33.390805 | 25.099171 | 611.346 | 696.000 | 100.0 | 84.654 | 103.957447 | 4.1 | 1164.388 | 95.2 | 2773.0 | 2882.740 | 95.1 | 1769.975 | 94.9 |
| 2 | Algeria | 10.076298 | 3.575963 | 7505.017 | 8517.017 | 1100.0 | 1012.000 | 17.730059 | 2.7 | 11498.038 | 81.8 | 238174.1 | 42228.408 | 83.6 | 30510.016 | 84.3 |
| 3 | Andorra | NaN | 1.765957 | 0.830 | 0.830 | NaN | 0.000 | 163.842553 | NaN | 9.187 | 100.0 | 47.0 | 77.006 | 100.0 | 67.766 | 100.0 |
| 4 | Angola | 0.220901 | 4.183043 | 4900.000 | 5215.000 | 4600.0 | 315.000 | 24.713072 | 15.4 | 10612.667 | 28.2 | 124670.0 | 30809.787 | 49.0 | 20161.538 | 75.4 |
I performed an inner join on the data between the two sources. Please note, due to differences in naming the join has resulted in data from some contries being dropped.
new = pd.merge(source_wide, countries, left_on="Area", right_on="name",how="left")
new = new.sort_values("region.value")
new.head()
| Area | % of cultivated land irrigated [harvested crop] | % of total country area cultivated | Arable land area | Cultivated area (arable land + permanent crops) | Number of people undernourished (3-year average) | Permanent crops area | Population density | Prevalence of undernourishment (3-year average) | Rural population | ... | region.value | adminregion.id | adminregion.iso2code | adminregion.value | incomeLevel.id | incomeLevel.iso2code | incomeLevel.value | lendingType.id | lendingType.iso2code | lendingType.value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 92 | Kiribati | NaN | 41.975309 | 2.0 | 34.0 | NaN | 32.0 | 143.020988 | 3.5 | 54.403 | ... | East Asia & Pacific | EAP | 4E | East Asia & Pacific (excluding high income) | LMC | XN | Lower middle income | IDX | XI | IDA |
| 110 | Marshall Islands | NaN | 47.222222 | 2.0 | 8.5 | NaN | 6.5 | 324.516667 | NaN | 12.212 | ... | East Asia & Pacific | EAP | 4E | East Asia & Pacific (excluding high income) | UMC | XT | Upper middle income | IDX | XI | IDA |
| 177 | Timor-Leste | 14.782609 | 15.467384 | 155.0 | 230.0 | 300.0 | 75.0 | 85.270612 | 24.0 | 919.214 | ... | East Asia & Pacific | EAP | 4E | East Asia & Pacific (excluding high income) | LMC | XN | Lower middle income | IDB | XH | Blend |
| 176 | Thailand | 34.664477 | 41.530246 | 16810.0 | 21310.0 | 5400.0 | 4500.0 | 135.306464 | 7.7 | 34627.095 | ... | East Asia & Pacific | EAP | 4E | East Asia & Pacific (excluding high income) | UMC | XT | Upper middle income | IBD | XF | IBRD |
| 61 | Fiji | 1.600000 | 13.683634 | 165.0 | 250.0 | NaN | 85.0 | 48.357033 | 6.3 | 399.125 | ... | East Asia & Pacific | EAP | 4E | East Asia & Pacific (excluding high income) | UMC | XT | Upper middle income | IDB | XH | Blend |
5 rows × 34 columns
new["% Safe Water"] = new["Total population with access to safe drinking-water (JMP)"]
As can be seen in the parallell catagories plot, lower income and lower middle income countries have lesser access to safe- drinking water. Sub-Sahara Africa is has the least access to safe drinking water.
fig = px.parallel_categories(new, dimensions=['Area', 'incomeLevel.value','region.value'],
color='% Safe Water',
width=1000, height=2000)
fig.update(layout_showlegend=False)
fig.show()
Unfortunatley due to the high amount of missing data, it is hard to see the connection between undernourishment and income level of a country. Although, the brighter yellow corresponds to a higher prevalance of undernourshment. This color is mostly present in Sub-Saharan Africa.
fig = px.parallel_categories(new, dimensions=['region.value', 'incomeLevel.value'],
color='Prevalence of undernourishment (3-year average)')
fig.update(layout_showlegend=False)
fig.show()
A bubble chart shows the prevalance of undernourshiment against the % of population with access to safe water.
fig = px.scatter(new.dropna(),
x="Prevalence of undernourishment (3-year average)", y="% Safe Water",
size="Total population", color="region.value",
hover_name="Area", size_max=60
)
fig.show()
fig = px.scatter(new.dropna(),
x="Prevalence of undernourishment (3-year average)", y="% Safe Water",
size="Total population", color="incomeLevel.value",
hover_name="Area", size_max=60
)
fig.show()
new["%land irrigated"] = new["% of cultivated land irrigated [harvested crop]"]
new["%cultivated"] = new["% of total country area cultivated"]
new["%undernourishment"] = new["Prevalence of undernourishment (3-year average)"]
new["%rural access to water"] = new["Rural population with access to safe drinking-water (JMP)"]
new["%urban access to water"] = new["Urban population with access to safe drinking-water (JMP)"]
new["%total access to water"] = new["Total population with access to safe drinking-water (JMP)"]
fig = px.scatter_matrix(new.dropna(),
dimensions=["%land irrigated",
"%cultivated",
"%undernourishment",
"%rural access to water",
"%urban access to water",
"%total access to water",
],
color="incomeLevel.value")
fig.update_traces(diagonal_visible=False)
fig.update_traces(showupperhalf=False)
fig.update_layout(
font=dict(
size=5,
))
fig.show()